Add example using xarray atmospheric data #32

jni · 2024-11-27T05:25:51Z

Example to view xarray atmospheric model data in napari.

The example data is too large to display in our gallery, so I'm making the PR
against my own repo for discussion. We can try the same approach with a smaller
dataset, or with a dataset in a cloud-native format like zarr, where therefore
the gallery would only need to load a single timepoint.

Code to download the data in this example:

# download model prediction data
curl -O https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20241104/0000/fc/ml/air_temp.nc
curl -O https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20241104/0000/fc/ml/spec_hum.nc

# download corresponding 10 days' worth of measurements
mkdir an && cd an  # use 'an' folder for single time points
for day in 04 05 06 07 08 09 10 11 12 13; do
  for hour in 00 06 12 18; do
    curl https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/202411${day}/${hour}00/an/ml/spec_hum.nc -o ${day}-${hour}-spec_hum.nc
    curl https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/202411${day}/${hour}00/an/ml/air_temp.nc -o ${day}-${hour}-air_temp.nc
  done
done

As noted in the example itself, napari's axes ([plane], row, column), with the
origin at the top left, match NumPy arrays but are unsuitable for latitude
data, which starts at 90 at the top and ends at -90 at the bottom. The polarity
of the latitude is therefore inverted for plotting, but we must add the ability
in napari to describe the geometry of the world space relative to canvas space.

jni · 2024-11-27T05:39:28Z

@tennlee here's an example of grabbing the coordinate metadata from xarray and passing it to napari.

Unfortunately, I'm finding the resampling in xarray to be suuuuper slow. Here's what I mean:

In [6]: ds = xr.open_dataset('spec_hum.nc', chunks={'time': 1})
/Users/jni/micromamba/envs/all/lib/python3.11/site-packages/xarray/core/dataset.py:282: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.

In [26]: ds_reg = ds.interp(coords={'time': np.arange(np.array(ds.time[0]), np.a
    ...: rray(ds.time[-1]), np.array(np.diff(ds.time[:2]))[0])}, method='nearest
    ...: ')

In [39]: %timeit arr = np.asarray(ds_reg.spec_hum[0])
16.4 s ± 214 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %timeit arr = np.asarray(ds.spec_hum[0])
518 ms ± 94.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [41]: %timeit arr = np.asarray(ds.spec_hum[0, 0])
124 ms ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit arr = np.asarray(ds_reg.spec_hum[0, 0])
34.4 s ± 7.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: %%timeit
    ...: arr = np.asarray(ds.spec_hum[0, 0])
    ...: idxs = np.meshgrid(np.arange(arr.shape[0]), np.arange(arr.shape[1]), in
    ...: dexing='ij')
    ...: arr_res = ndi.map_coordinates(arr, idxs, order=0)
    ...:
    ...:
486 ms ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What's happening behind the scenes in napari is:

loading xarray with chunks means that the xarray datasarray is backed by dask
when you move a slider in napari, napari materialises the requested array using np.asarray(arr[slice_info])
this is super slow currently for reasons I don't understand

(it's also annoying that the model data starts at 0100 and the measurements at 0000 😂)

It's still usable if you turn on async mode in napari, which you can do either with the NAPARI_ASYNC=1 environment variable, or by setting the experimental > render images asynchronously checkbox in the preferences (Cmd+, on Mac when the viewer is in focus).

Overall though, this is a very cool dataset. I like that it shows off napari's ability to overlay data with different time steps and extents, too. We could also load the temperature volumes and treat them the same way, and make them invisible by default. (Pass visible=False to the layer.)

Ideally, I'd like to save (a) the model resampled at 1h intervals, and (b) the measurements as geozarr or some format that xarray can natively read backed by zarr. If we put that online somewhere useful, this could go into the napari sample gallery (must be able to be built without downloading a massive dataset).

Any ideas?

jni · 2024-11-27T05:42:05Z

Oh I forgot to add some screenshots:

jni · 2024-11-28T13:43:46Z

Ah, @tennlee I figured out why the sampling is uneven in the raw data, now that I displayed it without resampling in its own viewer. If you hit play on the first viewer (the grayscale one), you will see that partway through the playback it speeds up. So the time interval increases as you get further into the model — which I guess makes sense since the model wouldn't be able to do hourly precision by that point anyway?

tennlee · 2024-11-28T23:38:25Z

I might have to get to this on the weekend. Dealing with 3 days of backlog apparently takes time. :)

tennlee · 2024-11-28T23:45:37Z

Just a note for something I thought would be neat to try. If there's a callback option in napari, it would be interesting to try to install scores into a local install and then register a verification calculation for overlay/visualisation against the main data. I haven't thought about how that would work, but it may be more interactive than doing all the data calc up front, particularly if people are being more investigative in reviewing data. Not saying it's important, I just wanted to share the idea so it's not just locked in my head only.

jni · 2024-11-29T04:18:01Z

If there's a callback option in napari, it would be interesting to try to install scores into a local install and then register a verification calculation for overlay/visualisation against the main data.

yes, this is easy to do! You can do viewer.dims.events.point.connect() which will fire whenever you move the sliders with the value of in "real" coordinates.

Having said that, in this case I would probably define a dask array based on the scores computation and display that, rather than hook a lower-dimensional array up to the current point.

jni · 2024-11-29T04:19:28Z

(Also btw — no rush on this, other than the excitement of finding a cool reason to collaborate. 😃 As I mentioned over Signal, I'm happy to come in to Melb some day to continue sprinting, when the time is right for you!)

tennlee · 2024-11-29T07:22:29Z

Cool. Well I've done the setup and reproduced the issue, which is a good first step.

tennlee · 2024-11-29T08:23:44Z

Okay, so I don't know why exactly, but I achieved

by removing the chunks argument in the original file load. Napari may need the chunking in that form, but the interpolation is quick without it.

I'll try loading the resulting thing into napari and testing.

(update - napari seems perfectly happy)

jni · 2024-11-29T12:47:34Z

The chunking is totally optional. I'm not surprised at those results @tennlee because without the chunks argument

ds is completely loaded into memory as a NumPy rather than dask array, and
ds_reg is computed eagerly (the ds_reg = ... call takes a long time, yes?), and uses a lot of memory in the process, too. I think you are just front-loading the slowness.

At that point, you have a NumPy array and everything should be super fast indeed. But your process will be using heaps of RAM, which is undesirable.

tennlee · 2024-11-30T00:26:49Z

This seems relevant: https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes

jni · 2024-12-09T04:32:03Z

Oof, found the culprit:

The current code also has the unfortunate side-effect of merging all chunks too.

from pydata/xarray#6799

jni · 2024-12-09T05:30:43Z

Given this limitation I think the solution for making this example work nicely is to save the interpolated array to zarr and host it somewhere.

Add xarray atmospheric data example

ba78eb9

jni added 2 commits November 28, 2024 18:12

Add raw data viewer; add axis labels

ebe47a7

don't try to open the same dataset twice — this crashes xarray

93ec6cd

formatting and use private camera to filp y

f49504e

jni added 2 commits December 9, 2024 16:32

Improve chunking based on h5 chunks

7d3396e

Use assume_sorted in interp — helps speed

09889cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example using xarray atmospheric data #32

Add example using xarray atmospheric data #32

jni commented Nov 27, 2024

jni commented Nov 27, 2024

jni commented Nov 27, 2024

jni commented Nov 28, 2024

tennlee commented Nov 28, 2024

tennlee commented Nov 28, 2024

jni commented Nov 29, 2024

jni commented Nov 29, 2024

tennlee commented Nov 29, 2024

tennlee commented Nov 29, 2024 •

edited

Loading

jni commented Nov 29, 2024

tennlee commented Nov 30, 2024

jni commented Dec 9, 2024

jni commented Dec 9, 2024

Add example using xarray atmospheric data #32

Are you sure you want to change the base?

Add example using xarray atmospheric data #32

Conversation

jni commented Nov 27, 2024

jni commented Nov 27, 2024

jni commented Nov 27, 2024

jni commented Nov 28, 2024

tennlee commented Nov 28, 2024

tennlee commented Nov 28, 2024

jni commented Nov 29, 2024

jni commented Nov 29, 2024

tennlee commented Nov 29, 2024

tennlee commented Nov 29, 2024 • edited Loading

jni commented Nov 29, 2024

tennlee commented Nov 30, 2024

jni commented Dec 9, 2024

jni commented Dec 9, 2024

tennlee commented Nov 29, 2024 •

edited

Loading